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Abstract. We present a fast direct solver for structured linear systems based on multilevel 
matrix compression. Using the recently developed interpolative decomposition of a low-rank matrix 
in a recursive manner, we embed an approximation of the original matrix into a larger, but highly 
structured sparse one that allows fast factorization and application of the inverse. The algorithm 
extends the Martinsson/Rokhlin method developed for 2D boundary integral equations and proceeds 
in two phases: a precomputation phase, consisting of matrix compression and factorization, followed 
by a solution phase to apply the matrix inverse. For boundary integral equations which are not 
too oscillatory, e.g., based on the Green's functions for the Laplace or low-frequency Helmholtz 
equations, both phases typically have complexity 0{N) in two dimensions, where N is the number 
of discretization points. In our current implementation, the corresponding costs in three dimensions 
are 0(iV^/^) and 0(N log N) for precomputation and solution, respectively. Extensive numerical 
experiments show a speedup of ~ 100 for the solution phase over modern fast multipole methods; 
however, the cost of precomputation remains high. Thus, the solver is particularly suited to problems 
where large numbers of iterations would be required. Such is the case with ill-conditioned linear 
systems or when the same system is to be solved with multiple right-hand sides. Our algorithm is 
implemented in Fortran and freely available. 
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1. Introduction. Many problems in computational science and engineering re- 
quire the solution of large, dense linear systems. Standard direct methods based 
on Gaussian elimination, of course, require 0{N^) work, where N is the system size. 
This quickly becomes infeasible as N increases. As a result, such systems are typically 
solved iteratively, combining GMRES ^41], Bi-CGSTAB [4^ or some other iterative 
scheme with fast algorithms to apply the system matrix, when available. For the 
integral equations of classical physics, this combination has led to some of the fastest 
solvers known today, with dramatically lower complexity estimates of the order 0{N) 
or 0{N\ogN) [ni 133 EHl and references therein]. 

Despite their tremendous success, however, iterative methods still have several 
significant disadvantages when compared with their direct counterparts: 

(i) The number of iterations required by an iterative solver is highly sensitive 
to the conditioning of the system matrix. Ill-conditioning arises, for example, in the 
solution of problems near resonance (particularly in the high frequency regime), in 
geometries with "close-to-touching" interactions, in multi-component physics models 
with large contrasts in material properties, etc. Under these circumstances, the solu- 
tion time can be far greater than expected. Direct methods, by contrast, are robust 
in the sense that their solution time does not degrade with conditioning. Thus, they 
are often preferred in production environments, where reliability of the solver and 
predictability of the solution time are important. 
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(ii) One often wishes to solve a linear system governed by a fixed matrix with 
multiple right-hand sides. This occurs, for example, in scattering problems, in opti- 
mization, and in the modeling of time-dependent processes in fixed geometry. Most 
iterative methods are unable to effectively exploit the fact that the system matrix is 
the same, and simply treat each right-hand side as a new problem. Direct methods, 
on the other hand, are extremely efficient in this regard: once the system matrix has 
been factored, the matrix inverse can be applied to each right-hand side at a much 
lower cost. 

(iii) One often wishes to solve problems when the system matrix is altered by a 
low-rank modification. Standard iterative methods do a poor job of exploiting this 
fact. Direct methods, on the other hand, can update the factorization of the orig- 
inal matrix using the Sherman-Morrison- Woodbury formula |29j or use the existing 
factorization as a preconditioner. 

In this paper, we present an algorithm for the solution of structured linear sys- 
tems that overcomes these deficiencies, while remaining competitive with modern fast 
iterative solvers in many practical situations. The algorithm directly constructs a 
compressed ("data-sparse") representation of the system matrix inverse, assuming 
only that the matrix has a block low-rank structure similar to that utilized by fast 
matrix- vector product techniques like the fast multipole method (FMM) 22, ,23^. Such 
matrices typically arise from the discretization of integral equations, where the low- 
rank structure can be understood in terms of far-field interactions between clusters of 
points, but the procedure is general and makes no a priori assumptions about rank. 
Our scheme is a multilevel extension of the work described in [3T] , which itself is based 
on the fast direct multilevel method developed for 2D boundary integral equations by 
Martinsson and Rokhlin [55] . 

While we do not seek to review the literature on fast direct solvers here, it is 
worth noting that similar efforts have been (and continue to be) pursued by various 
groups, most notably in the context of hierarchically semiseparable (HSS) matrices 
[BJ [7| 03] and H-matrices [551 113 I2E] ■ A short historical discussion can be found in 
PT] as well as in the recent article by Gillman et al. ^T7]. The latter paper makes 
several improvements on the algorithm of |35j . and presents a simple framework for 
understanding, implementing, and analyzing schemes for inverting integral equations 
on curves (that is, domains parametrized by a single variable). Planar domains with 
corners were treated recently in Applications to electromagnetic wave problems 
were considered in [35] U?]- Finally, it should be noted that Gillman's dissertation 
[TB] includes 3D experiments that also extend the Martinsson-Rokhlin formalism to 
the case of integral equations on surfaces. 

The present paper provides a mix of analysis, algorithmic work, and applications. 
The novelty of our contribution lies: 

(i) in the use of compression and auxilliary variables to embed an approximation 
of the original dense matrix into a sparse matrix framework that can make use of 
standard and well-developed sparse matrix technology; 

(ii) in providing detailed numerical experiments in both 2D and 3D; and 

(iii) in demonstrating the utility of fast direct solvers in several applications. 
We believe that the scheme is substantially simpler to implement than prior schemes 
and that it leads to a more stable solution process. 

As in previous schemes (see, e.g., |17)). the core algorithm in our work com- 
putes a compressed matrix representation using the interpolative decomposition (ID) 
[101 1321 148| via a multilevel procedure that we refer to as recursive skeletonization. 
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Once obtained, the compressed representation serves as a platform for fast matrix 
algebra including matrix-vector multiplication and matrix inversion. In its former 
capacity, the algorithm may be viewed as a generalized or kernel-independent FMM 
[191 1361 150j : we explore this application in fj6j For matrix inversion, we show how 
to embed the compressed representation in an equivalent (but larger) sparse system, 
much in the style of O [39] . We then use a state-of-the-art sparse matrix solver to do 
the rest. We are grateful to David Bindel for initially suggesting an investigation of 
the sparse matrix formalism and rely in this paper on the sparse direct solver software 
UMFPACK [T^[T3]. As in dense LU factorization, the direct solver is a two-phase pro- 
cess. First, following the generation of the compressed matrix embedding, a factored 
representation of the inverse is constructed. Second, in the solution phase, the matrix 
inverse is applied in a rapid manner to a specified right-hand side. As expected, the 
solution phase is very inexpensive, often beating a single FMM call by several orders 
of magnitude. For boundary integral equations without highly oscillatory kernels, 
e.g., the Green's function for the Laplace or low-frequency Helmholtz equation, both 
phases typically have complexity 0{N) in 2D. In 3D, the complexities in our current 
implementation are 0{N^^^) and 0{N log N) for precomputation (compression and 
factorization) and solution, respectively. 

The remainder of this paper is organized as follows. In ^ we define the ma- 
trix structure of interest and review certain aspects of the ID. In f|3j we review the 
recursive skeletonization algorithm for matrix compression and describe the new for- 
malism for embedding the compressed matrix in a sparse format. In ^|4] we study the 
complexity of the algorithm for non-oscillatory problems, while in ^|5] we give error 
estimates for applying a compressed matrix and its inverse. In f|6] we demonstrate the 
efficiency and generality of our scheme by reporting numerical results from its use as a 
generalized FMM, as a direct solver, and as an accelerator for molecular electrostatics 
and scattering problems. Finally, in f|7j we summarize our findings and discuss future 
work. 

2. Preliminaries. In this section, we discuss the precise matrix structure that 
makes our fast solver possible. For this, let A e ([^nxn matrix whose index 

vector J = (l,2,...,iV) is grouped into p contiguous blocks of rii elements each, 
where J2i=i rii ^ N: 

i~l i \ 

^ + 1, ^ nj + 2, . . . , ^ nJ , i = l,...,p. 

Then the linear system ylx — b can be written in the form 

p 

A^jXj — hi, i — 1, . . . ,p, 

where Xi,b,; e C"' and Aij £ C"'^"^. Solution of the full linear system by classical 
Gaussian elimination is well-known to require 0{N^) work. 

Definition 2.1 (block separability). The matrix A is said to be block separable 
if each off-diagonal submatrix Aij can be decomposed as the product of three low-rank 
matrices: 



(2.1) 



Aij — LiSijRj, i j, 
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where Li e C 



riiXk' 



Sij G 



id R-j G 



ith kjjkf <^ Tii. Note that in 



(2.1), the left matrix Li depends only on the index i and the right matrix Rj depends 
only on the index j . 

We will see how such a factorization arises below. The term block separable was 
introduced in [17] , and is closely related to that of semiseparable matrices [6l [3 [49] 
and "H-matrices [261 123 EH] . In [21] , the term structured was used, but block separable 
is somewhat more informative. 

Definition 2.2 (off-diagonal block rows and columns). The ith off-diagonal 



Aip ] consisting of the 
deleted; the off-diagonal block columns 



block row of A is the submatrix [An 
ith block row of A with the diagonal block A 
of A are defined analogously. 

Clearly, the block separability condition (2.1 1 is equivalent to requiring that the 
zth off-diagonal block row and column have rank fc^ and fej^, respectively, for i = 
1, . . . ,p (see Sjsjfor details). 

When A is block separable, it can be written as 

(2.2) A = D + LSR, 

where 

" An 

D = 

Ar. 



^NxN 



is block diagonal, consisting of the diagonal blocks of A, 



L 



Li 



Lv 



R 



Ri 



Rn 



are block diagonal, where 



S = 



ELi k\ and = 
■ ■ • Sip 

S2I • • • S2p 



ELi and 







■^K.xK^ 



Spl Op2 

is dense with zero diagonal blocks. It is convenient to let z 
can then write the original system in the form 



i?x and y = S2,. We 



(2.3) 



D 
R 



L 



-I 
S 





X 




" b ' 




y 









z 








This system is highly structured and sparse, and can be efficiently factored using 
standard techniques. If we assume that each block corresponds to Ni = N /p unknowns 
and that the ranks k\ = k^ = k oi the off-diagonal blocks are all the same, it is 
straightforward to see [T71 [5T] that a scheme based on (2.2) or ( |2.3[ ) requires an 
amount of work of the order 0{p{N /pY +p^k^). 

In many contexts (including integral equations), the notion of block separability 
is applicable on a hierarchy of subdivisions of the index vector. That is to say, a 



decomposition of the form (2.2 1 can be constructed at each level of the hierarchy. 
When a matrix has this structure, much more powerful solvers can be developed, but 
they will require some additional ideas (and notation). 
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Level 1: blocks 



Fig. 2.1. An example of a tree structure imposed on the index vector (1,2, . . . ,N). At each 
level of the hierarchy, a contiguous block of indices is divided into a set of children, each of which 
corresponds to a contiguous subset of indices. 



column 




Fig. 2.2. Matrix rank structure. At each level of the index tree, the off-diagonal block rows 
and columns (black) must have low numerical rank; the diagonal blocks (white) can in general be 
full-rank. 



2.1. Hierarchically structured matrices. Our treatment in this section fol- 
lows that of [TT]. Let J = (1, 2, ... , N) be the index vector of a matrix A e C^""^. 
We assume that a tree structure r is imposed on J which is A + 1 levels deep. At 
level I, we assume that there are pi nodes, with each such node J^'''^ corresponding to 
a contiguous subsequence of J such that 



J. 



We denote the finest level as level 1 and the coarsest level as level A + 1 (which consists 
of a single block). Each node ji'^ at level I > 1 has a finite number of children at 
level I 



1 whose concatenation yields the indices in jj^^^ (Fig- 2.1). 
The matrix A is hierarchically block separable [TT] if it is block separable at each 
level of the hierarchy defined by r. In other words, it is structured in the sense of the 
present paper if, on each level of r, the off-diagonal block rows and columns are low- 



rank (Fig. 2.2). Such matrices arise, for example, when discretizing integral equations 
with non-oscillatory kernels (up to a specified precision). 
Example 1. Consider the integral operator 



(2.4) 

where 

(2.5) 



0(x) = / G{x,y)p{y)dy 



G{x,y) 



27r 



log \x - y\ 



is the Green's function for the 2D Laplace equation, and the domain of integration is 
a square B in the plane. This is a 2D volume integral operator. Suppose now that we 
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discretize (2.4) on a V N x VN grid: 



(2.6) (j){x^) ^ ^'^G{x^,Xj)p{Xj). 

(This is not a high-order quadrature but that is really a separate issue.) Let us 
superimpose on i? a quadtree of depth A + 1, where B is the root node (level A + 1). 
Level A is obtained from level A + 1 by subdividing the box B into four equal squares 
and reordering the points Xi so that each child holds a contiguous set of indices. This 
procedure is carried out until level 1 is reached, reordering the nodes at each level 
so that the points contained in every node at every level correspond to a contiguous 



set of indices. It is clear that, with this ordering, the matrix corresponding to (2.61 
is hierarchically block separable, since the interactions between nonadjacent boxes at 
every level are low-rank to any specified precision (from standard multipole estimates 
[22]). Adjacent boxes are low-rank for a more subtle reason (see Sjljand Fig. 4.1 1 



Example 2. Suppose now that we wish to solve an interior Dirichlet problem for 
the Laplace equation in a simply connected 3D domain fl with boundary dfl: 

(2.7) Am = infl, u = f on dfl. 

Potential theory [24] suggests that we seek a solution in the form of a double-layer 
potential 

f dG 

(2.8) u{x) ^ — {x, y) a (y) dy for xGVL, 

Jan ovy 

where 

(2.9) G{x,y) = -^ 

47r \x — y\ 

is the Green's function for the 3D Laplace equation, Vy is the unit outer normal at 
y e 951, and a is an unknown surface density. Letting x approach the boundary, this 
gives rise to the second-kind Fredholm equation 

(2.10) -\a{x)+ j ^ (x, y) a (y) dy = f (x) . 

Using a standard collocation scheme based on piecewise constant densities over a 
triangulated surface, we enclose dVt in a box B and bin sort the triangle centroids 
using an octree where, as in the previous example, we reorder the nodes so that each 
box in the hierarchy contains contiguous indices. It can be shown that the resulting 
matrix is also hierarchically block separable (see ^and [H]). 

We turn now to a discussion of the ID, the compression algorithm that we will use 
to compute low-rank approximations of off-diagonal blocks. A useful feature of the 
ID is that it is able to compute the rank of a matrix on the fly, since the exact ranks 
of the blocks are difficult to ascertain a priori — that is to say, the ID is rank-revealing. 

2.2. Interpolative decomposition. Many decompositions exist for low-rank 
matrix approximation, including the singular value decomposition, which is well- 
known to be optimal [20]. Here, we consider instead the ID [1011321148], which produces 
a near-optimal representation that is more useful for our purposes as it permits an 
efficient scheme for multilevel compression when used in a hierarchical setting. 

Definition 2.3 (interpolative decomposition). Let A e C™^" he a matrix, and 
II • II the matrix 2-norm. A rank-k approximation of A in the form of an interpolative 
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Fig. 3.1. One Zei;eZ of matrix compression, obtained by sequentially compressing the off-diagonal 
block rows and columns. At each step, the matrix blocks whose row or column spaces are being 
compressed are highlighted in white. 



decomposition (ID) is a representation A sa BP, where B G c^x^^ whose columns 
constitute a subset of the columns of A, and P G C'^^", a subset of whose columns 
makes up the k x k identity matrix, such that \\P\\ is .small and \\A — BP\\ ^ (Jk+i, 
where (Tk+i is the (k + 1 )st greatest singular value of A. We call B and P the skeleton 
and projection matrices, respectively. 

Clearly, the ID compresses the column space of A; to compress the row space, 
simply apply the ID to A^ , which produces an analogous representation A — PB, 
where P G C™^'^ and 13 G C^^". 

Definition 2.4 (row and column skeletons). The row indices that corrrespond 
to the retained rows in the ID are called the row or incoming skeletons. The column 
indices that corrrespond to the retained columns in the ID are called the column or 
outgoing skeletons. 

Reasonably efficient schemes for constructing an ID exist (TUl |3H EH] • By com- 
bining such schemes with methods for estimating the approximation error, we can 
compute an ID to any relative precision e > by adaptively determining the required 
rank k [35]. This is the sense in which we will use the ID. 

While previous related work pTl[55] used the deterministic 0{kmn) algorithm of 
[10) . we employ here the latest compression technology based on random sampling, 
which typically requires only 0{mnlogk + k'^n) operations [32 ] I48 j . 



3. Algorithm. In this section, we first describe the "standard" ID-based fast 
multilevel matrix compression algorithm (as in |171 135] 1. The HSS and H-matrix 
formalisms use the same underlying philosophy [51 [T] [551 113 [2H1 HH] ■ We then describe 
our new inversion scheme. 

3.1. Hierarchical matrix compression. Let A G C^^^ be a matrix with 
pxp blocks, structured in the sense of |2.1[ and e > a target relative precision. We 
first outline a one-level matrix compression scheme: 

1. For J = 1, . . . use the ID to compress the row space of the ith off-diagonal 
block row to precision e. Let Li denote the corresponding row projection matrix. 

2. Similarly, for j — 1, . . . ,p, use the ID to compress the column space of the 
jth off-diagonal block column to precision e. Let Rj denote the corresponding column 
projection matrix. 

3. Approximate the off-diagonal blocks of A by Aij LiSijRj for i ^ j , where 
Sij is the submatrix of Aij defined by the row and column skeletons associated with 
Li and Rj, respectively. 

This yields precisely the matrix structure discussed in f|2j following (2.1 1. The 



one-level scheme is illustrated graphically in Fig. 3.1 

The multilevel algorithm is now just a simple extension based on the observation 
that by ascending one level in the index tree and regrouping blocks accordingly, we 
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Fig. 3.2. Multilevel matrix compression, comprising alternating steps of compression and re- 
grouping via ascension of the index tree. The diagonal blocks (white and gray) are not compressed, 
but are instead extracted at each level of the tree; they are shown here only to illustrate the regrouping 
process. 



can compress the skeleton matrix S in (2.2 1 in exactly the same form, leading to a 



procedure that we naturally call recursive .skeletonization (Fig. |3.2[ ). 
The full algorithm may be specified as follows: 

1. Starting at the leaves of the tree, extract the diagonal blocks and perform 
one level of compression of the off-diagonal blocks. 

2. Move up one level in the tree and regroup the matrix blocks according to the 
tree structure. Terminate if the new level is the root; otherwise, extract the diagonal 
blocks, recompress the off-diagonal blocks, and repeat. 

The result is a telescoping representation 

(3.1) A « + id) ^ ^(2) . . ^(A) ^ ^(A)^^(A) . . ^(2)j ^(1)^ 

where the superscript indexes the compression level I — I, . . . , X. 

Example 3. As a demonstration of the multilevel compression technique, consider 
the matrix defined by A'^ = 8192 points uniformly distributed in the unit square, inter- 



acting via the 2D Laplace Green's function (2.5) and sorted according to a quadtree 
ordering. The sequence of skeletons remaining after each level of compression to 
e = 10^'^ is shown in Fig. 3.3 from which we see that compression creates a spar- 
sification of the sources which, in a geometric setting, leaves skeletons along the 
boundaries of each block. 

3.2. Accelerated compression via proxy surfaces. The computational cost 
of the algorithm described in the previous section is dominated by the fact that each 
step is global: that is, compressing the row or column space for each block requires 
accessing all other blocks in that row or column. If no further knowledge of the matrix 
is available, this is indeed necessary. However, as noted by [TOl [211 |35j [37] , this global 
work can often be replaced by a local one, resulting in considerable savings. 

A sufhcient condition for this acceleration is that the matrix correspond to eval- 
uating a potential field for which some form of Green's identities hold. It is easiest 
to present the main ideas in the context of Laplace's equation. For this, consider 
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No = 8192 Ni = 7134 N2 = 4138 




N3 = 1849 N4, = 776 Ns = 265 




Fig. 3.3. Sparsification by recursive skeletonization. Logarithmic interactions between N = 
8192 points in the unit square are compressed to relative precision e = 10~^ using a five-level 
quadtree-based scheme. At each level, the surviving skeletons are shown, colored by block index, with 
the total number of skeletons remaining given by Ni for compression level I = 0, ... ,5, where I = 
denotes the original uncompressed system. 

Fig. |3.4[ which depicts a set of sources in the plane. We assume that block index i 
corresponds to the sources in the central square B. The ith off-diagonal block row 
then corresponds to the interactions of all points outside B with all points inside 
B. We can separate this into contributions from the near neighbors of B, which are 
local, and the distant sources, which lie outside the near-neighbor domain, whose 
boundary is denoted by T. But any field induced by the distant sources induces a 
harmonic function inside T and can therefore be replicated by a charge density on F 
itself. Thus, rather than using the detailed structure of the distant points, the row 
(incoming) skeletons for B can be extracted by considering just the combination of 
the near-neighbor sources and an artifical set of charges placed on F, which we refer to 
as a proxy surface. Likewise, the column (outgoing) skeletons for B can be determined 
by considering only the near neighbors and the proxy surface. If the potential field is 
correct on the proxy surface, it will be correct at all more distant points (again via 
some variant of Green's theorem). 

The interaction rank between F and B is constant (depending on the desired 
precision) from standard multipole estimates |22l [23] . In summary, the number of 
points required to discretize F is constant, and the dimension of the matrix to compress 
against for the block corresponding to B is essentially just the number of points in 
the physically neighboring blocks. 

Similar arguments hold for other kernels of potential theory including the heat, 
Helmholtz, Yukawa, Stokes, and elasticity kernels, though care must be taken for 
oscillatory problems which could require a combination of single and double layer 
potentials to avoid spurious resonances in the representation for the exterior. 



3.3. Compressed matrix-vector multiplication. The compressed represen- 
tation (3.1 ) admits an obvious fast algorithm for computing the matrix- vector product 
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rated 



Fig. 3.4. Accelerated compression using proxy surfaces. The field within a region B due to a 
distribution of exterior sources (left) can be decomposed into neighboring and well-separated contri- 
butions. By representing the latter via a proxy surface F ( center), the matrix dimension to compress 
against for the block corresponding to B (right) can be reduced to the number of neighboring points 
plus a constant set of points on F, regardless of how many points lie beyond F. 



y = Ayi. As shown in [T7], one simply applies the matrices in (3.1 ) from right to left. 
Like the FMM, this procedure can be thought of as occurring in two passes: 

1. An upward pass, corresponding to the sequential application of the column 
projection matrices i?^'-*, which hierarchically compress the input data x to the column 
(outgoing) skeleton subspace. 

2. A downward pass, corresponding to the sequential application of the row 
projection matrices L^^\ which hierarchically project onto the row (incoming) skeleton 
subspace and, from there, back onto the output elements y. 



3.4. Compressed matrix inversion. The representation (3.1) also permits a 
fast algorithm for the direct inversion of nonsingular matrices. The one-level scheme 
was discussed in ^j2] In the multilevel scheme, the system Sz = y in (2.3) is itself 
expanded in the same form, leading to the sparse embedding 



(3.2) 



-/ 



i(2) 



-I 







X 




" b ' 
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To understand the consequences of this sparse representation, it is instructive 
to consider the special case in which the row and column skeleton dimensions are 
identical for each block, say k, so that the total row and column skeleton dimensions 
are K = = — pk. Then, studying (2.3 ) first and assuming that D is invertible, 
block elimination of x and y yields 

(A + S*) z = Ai^D-^b, 

where A = {RD~^ L)~^ £ ^kxk block diagonal. Back substitution then yields 

X = \d-^ - D^^LARD^^ + D^^LA (A + Sy^ ARD^^] b. 



In other words, the matrix inverse is 



(3.3) 
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where 
and 

are all block diagonal, and 

S = A + S eC^""^ 

is dense. Note that S is equal to the skeleton matrix S with its diagonal blocks filled 



in. Thus, (3.3) is a compressed representation of A ^ with minimal fill-in over the 



original compressed representation ( 2.2 1 of A. In the multilevel setting, one carries out 



the above factorization recursively, since S can now be inverted in the same manner: 

(3.4) « 2?(1) + £(1) + £(2) . . p(A) ^ £(A)5-l7^(A) . . ^(2)j ^(1)^ 

This point of view is elaborated in [T7]. 

In the general case, the preceding procedure may fail if D happens to be singular 
and (more generally) may be numerically unstable if care is not taken to stabilize the 
block elimination scheme using some sort of pivoting. Thus, rather than using the 
"hand-rolled" Gaussian elimination scheme of [HI |T71 |3S] to compute the telescoping 



inverse (3.4), we rely instead on the existence of high-quality sparse direct solver 



software. More precisely, we simply supply UMFPACK with the sparse representation 



(3.2) and let it compute the corresponding factorization. Numerical results show that 



the performance is similar to that expected from (3.4 1 



4. Complexity analysis. For the sake of completeness, we briefly analyze the 
complexity of the algorithm presented in f|3]for a typical example: discretization of the 



integral operator (2.4), where the integral kernel has smoothness properties similar 
to that of the Green's function for the Laplace equation. We follow the analysis of 
[TBI [T71 155] and estimate costs for the "hand-rolled" Gaussian elimination scheme. 
We ignore quadrature issues and assume that we are given a matrix A acting on N 
points distributed randomly in a d-dimensional domain, sorted by an orthtree that 
uniformly subdivides until all block sizes are 0(1). (In ID, an orthtree is a binary 
tree; in 2D, it is a quadtree; and in 3D, it is an octree.) 

For each compression level I = 1, . . . , A, with I = 1 being the finest, let pi be 
the number of matrix blocks, and ni and fc; the uncompressed and compressed block 
dimensions, respectively, assumed equal for all blocks and identical across rows and 
columns, for simplicity. We first make the following observations: 

(i) The total matrix dimension is pini — N, where ni = 0(1), so pi ~ N. 

(ii) Each subdivision increases the number of blocks by a factor of roughly 2"^, 
so pi ~ pi-i/2'' ~ pi/2''('-i\ In particular, px = 0(1), so A ~ log^a N = (1/d) log A^. 

(iii) The total number of points at level / > 1 is equal to the total number of 
skeletons at level I — 1, i.e., pini = pi-iki^i, so ni ^ 

Furthermore, we note that ki is on the order of the interaction rank between 
two adjacent blocks at level I, which can be analyzed by recursive subdivision of 
the source block to expose well-separated structures with respect to the target (Fig. 



4.1). Assuming only that the interaction between a source subregion separated from 



12 



K. L. HO AND L. GREENGARD 




Fig. 4.1. The interaction rank between two adjacent blocks can be calculated by recursively 
subdividing the source block (white) into well-separated subblocks with respect to the target (gray), 
each of which have constant rank. 



a target by a distance of at least its own size is of constant rank (to a fixed precision 
e), we have 



log„(j ni 



logn/ if d = 1, 
nl-'/'' ifd>l, 



1=1 

where, clearly, ni ^ {pi/pi)ni ^ 2^('~^'ni, so 

r (/-l)log2 + logni ifd=l, 
'''^1 2('^-i)('-i)n}-i/'^ ifd> 1. 

4.1. Matrix compression. From |2.2[ the cost of computing a rank-fc ID of 
an m X n matrix is 0(m?i log fc + k'^n). We will only consider the case of proxy 
compression, for which m — 0{ni) for a block at level Z, so the total cost is 

^ ' N if d= 1, 



(4.1) Ten, log ^' + ^''"0 ^ I 

1=1 ^ 



^3(l-l/d) ifrf>l. 



4.2. Matrix-vector multiplication. The cost of applying I?^'^ is 0{pinf), 
while that of applying i^'^ or i?^') is 0{pikini). Combined with the 0{{p\k\)^) cost 
of applying 5, the total cost is 

A ( N ifd==l, 

(4.2) T,^,^ypini{ki+ni) + {pxkxf^\N\ogN if d = 2, 

1=1 [ iV2(i-i/d) ifd>2. 

4.3. Matrix factorization and inverse application. We turn now to the 
analysis of the cost of factorization using (3.4). At each level /, the cost of constructing 
D^^ and A is 0{pinf), after which forming I?'^'^ /^'^'^ and 7?.*^'^ all require 0{pinf) 
operations; at the final level, the cost of constructing and inverting S is 0{{pxk\)^). 
Thus, the total cost is 



Tiu ^'^Pinf + {p\k. 



3 

X) 



1=1 



which has complexity (4.1). 

Finally, we note that the dimensions of I?^'), 7l^'-\ and S^^ are the same 
as those of D^'', L^'^ R^''\ and 5, respectively. Thus, the total cost of applying the 
inverse, denoted by Tsv, has the same complexity as Tmv, namely (4.2). 
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In our UMFPACK-based approach, the estimation of cost is a rather comphcated 
task, and we do not attempt to carry out a detailed analysis of its performance. Suffice 
it to say, there is a one-to-one correspondence between the "hand-rolled" Gaussian 
elimination approach and one possible elimination scheme in UMFPACK. Since that 
solver is highly optimized, the asymptotic cost should be the same (or better). For 
some matrices, it is possible that straight Gaussian elimination may be unstable with- 
out pivoting, while UMFPACK will carry out a backward-stable scheme. This is a 
distinct advantage of the sparse matrix approach although the complexity and fill-in 
analysis then becomes more involved. 

4.4. Storage. An important issue in direct solvers, of course, is that of storage 
requirements. In the present setting the relevant matrices are the compressed sparse 



representation (3.2) and the factorization computed within UMFPACK. This will 
be (4.2 1 for the forward operator and, in the absence of pivoting, for the sparse 
factorization as well. If pivoting is required, the analysis is more complex as it involves 
some matrix fill-in and is postponed to future work. 

5. Error analysis. We now state some simple error estimates for applying and 
inverting a compressed matrix. Let A be the original matrix and A^^ its compressed 
representation, constructed using the algorithm of ^ such that 

\\A^AA\ ^ 
< e 



\\A\\ 

for some e > 0. Note that this need not be the same as the specified local precision in 
the ID since errors may accumulate across levels. However, as in |17j . we have found 
that such error propagation is mild. 

Let X and b be vectors such that Ax ~ b. Then it is straightforward to verify 
that for be = A^x, 

where k{A) is the condition number of A. Furthermore, if = A~^h, then 

||x-xe|| ^ 2eK (A) 
||x|| - l-eK{A)' 

In particular, if A is well-conditioned, e.g., A is the discretization of a second-kind 
integral equation, then k(A) = 0(1), so 



6. Numerical examples. In this section, we investigate the efficiency and flex- 
ibility of our algorithm by considering some representative examples. We begin with 
timing benchmarks for the Laplace and Helmholtz kernels in 2D and 3D, using the al- 
gorithm both as an FMM and as a direct solver, followed by applications in molecular 
electrostatics and multiple scattering. 

All matrices were blocked using quadtrees in 2D and octrees in 3D, uniformly 
subdivided until all block sizes were C(l), but adaptively truncating empty boxes 
during the refinement process. Only proxy compression was considered, with proxy 
surfaces constructed on the boundary of the supercell enclosing the neighbors of each 
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Table 6.1 

Numerical results for applying the Laplace kernel in the 2D surface case at precision e = 10~^; 
A'', uncompressed matrix dimension; Kr, row skeleton dimension; Kc, column skeleton dimension; 
Tcm, matrix compression time (s); Tmv, matrix-vector multiplication time (s); E, relative error; M, 
required storage for compressed matrix (MB). 



N 




Kc 








E 




M 


1024 


94 


94 


6.7E-2 


I.Oe- 


3 


3.1e- 


8 


8.5E-1 


2048 


105 


104 


1.4E-1 


I.Oe- 


3 


4.5e- 


8 


1.7E+0 


4096 


113 


114 


3.1E-1 


I.Oe- 


3 


I.Ie- 


7 


3.4E+0 


8192 


123 


123 


6.7E-1 


3.0e- 


3 


4.4e- 


7 


6.4E+0 


16384 


133 


134 


1.4E+0 


7.0e- 


3 


4.0e- 


7 


1.3E+1 


32768 


142 


142 


2.7E+0 


1.4e- 


2 


4.7e- 


7 


2.5e+1 


65536 


150 


149 


5.4E+0 


2.8e- 


2 


9.4e- 


7 


5.0E+1 


131072 


159 


158 


l.lE + 1 


5.7e- 


2 


9.8e- 


7 


I.Oe+2 



block. We discretized all proxy surfaces using a constant number of points, indepen- 
dent of the matrix size N: for the Laplace equation, this constant depended only 
on the compression precision e, while for the Helmholtz equation, it depended also 
on the wave frequency, chosen to be consistent with the Nyquist-Shannon sampling 
theorem. Computations were performed over M instead of C, where possible. The 
algorithm was implemented in Fortran, and all experiments were performed on a 2.66 
GHz processor in double precision. 

In many instances, we compare our results against those obtained using LA- 
PACK/ ATLAS dmS] and the FMM [H |22l EH |40| . All FMM timings were computed 
using the open-source FMMLIB package [15], which is a fairly efficient implemen- 
tation but does not include the plane-wave optimizations of [5J or the diagonal 
translation operators of [40j. 

6.1. Generalized fast multipole method. We first consider the use of recur- 
sive skeletonization as a generalized FMM for the rapid computation of matrix- vector 
products. 

6.1.1. The Laplace equation. We considered two point distributions in the 
plane: points on the unit circle and in the unit square, hereafter referred to as the 
2D surface and volume cases, respectively. We assumed that the governing matrix 



corresponds to the interaction of charges via the Green's function (2.5 1. The surface 
case is typical of layer-potential evaluation when using boundary integral equations. 
Since a domain boundary in 2D can be described by a single parameter (such as 
arclength), it is a ID domain, so the expected complexities from [|4] correspond to d = 
1: 0{N) work for both matrix compression and application. (See [T7] for a detailed 
discussion of the d = 1 case.) In the volume case, the dimension is d = 2, so the 
expected complexities are 0{N'^^^) and ©(TV log A^) for compression and application, 
respectively. 



For the 3D Laplace kernel (2.9), we considered surface and volume point geome- 
tries on the unit sphere and within the unit cube, respectively. The corresponding 
dimensions are d = 2 and d = 3. Thus, the expected complexities for the 3D surface 
case are 0{N^/^) and 0{N log N) for compression and application, respectively, while 
those for the 3D volume case are 0{N^) and 0{N^/^), respectively 

We present timing results for each case and compare with LAPACK/ ATLAS 
and the FMM for a range of A^ at e = 10~^. Detailed data are provided in Tables 
|6.1f|6.4| and plotted in Fig. |6.1| It is evident that our algorithm scales as predicted. 
Its performance in 2D is particularly strong. Not only does our algorithm beat the 
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2D surface 2D volume 




10^ 10'' 10^ 10^ 10-* 

N 



o— o LP pc □— □ LP mv FMM •— • RS pc ■— ■ RS mv 



Fig. 6.1. CPU times for applying the Laplace kernel in various cases using LAPACK/ATLAS 
(LP), the FMM, and recursive skeletonization (RS) as a function of the matrix size N . For LP 
and RS, the computation is split into two parts: precomputation (pc), for LP consisting of matrix 
formation and for RS of matrix compression, and matrix-vector multiplication (mv). The precision 
of the FMM and RS was set at e = 10~^. Dotted lines indicate extrapolated data. 

Table 6.2 

Numerical results for applying the Laplace kernel in the 2D volume case at precision e = 10~^; 
notation as in Table\6J\ 



N 


K, 




T 

^ cm 


T 

^ mv 




E 




M 


1024 


299 


298 


3.3E-1 


I.Oe- 


3 


3.6e- 


10 


2.9e+0 


2048 


403 


405 


8.9E-1 


I.Oe- 


3 


3.7e- 


10 


7.1E+0 


4096 


570 


570 


2.7E+0 


5.0e- 


3 


I.Oe- 


09 


1.8E+1 


8192 


795 


793 


6.8E+0 


I.Oe- 


2 


8.8e- 


10 


4.3E+1 


16384 


1092 


1091 


1.8E+1 


2.3e- 


2 


7.7e- 


10 


I.Oe+2 


32768 


1506 


1505 


4.4E+1 


4.5e- 


2 


I.Oe- 


09 


2.3e+2 


65536 


2099 


2101 


1.3E+2 


I.Ie- 


1 


I.Ie- 


09 


5.3E+2 


131072 


2904 


2903 


3.4E+2 


2.7e- 


1 


I.Ie- 


09 


1.2E+3 



0{N'^) uncompressed matrix-vector product for modest N, it is faster even than the 
0{N) FMM (at least after compression). In 3D, the same is true over the range of 
N tested, although the increase in asymptotic complexity would eventually make the 
scheme less competitive. In all cases studied, the compression time Tcm was larger 
than the time to apply the FMM by one (2D surface) to two (all other cases) orders 
of magnitude, while the compressed matrix-vector product time T^v was consistently 
smaller by the same amount. Thus, our algorithm also shows promise as a fast iterative 
solver for problems requiring more than ^ 10-100 iterations. Furthermore, we note 
the effectiveness of compression: for N — 131072, the storage requirement for the 
uncompressed matrix is 137 GB, whereas that for the compressed representations are 
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Table 6.3 

Numerical results for applying the Laplace kernel in the 3D surface case at precision e = 10 
notation as in Table[6l\ 



N 


K, 


Kc 








E 




M 


1024 


967 


967 


5.2E-1 


I.Oe- 


3 


I.Oe- 


11 


7.7E+0 


2048 


1531 


1532 


1.4E+0 


4.0e- 


3 


1.8e- 


10 


2.2E+1 


4096 


2298 


2295 


6.1E+0 


I.Ie- 


2 


1.4e- 


10 


6.2E+1 


8192 


3438 


3426 


2.7E+1 


2.9e- 


2 


1.2e- 


10 


1.7E+2 


16384 


4962 


4950 


8.7E+1 


7.2e- 


2 


3.0e- 


10 


4.2E+2 


32768 


6974 


6987 


3.1E+2 


1.7e- 


1 


4.3e- 


10 


9.9e+2 


65536 


9899 


9925 


9.2E+2 


4.5e- 


1 


7.7e- 


10 


2.3e+3 



Table 6.4 

Numerical results for applying the Laplace kernel in the 3D volume case at precision e = 10 
notation as in Table lKT] 



N 














E 




M 


1024 




1024 


1024 


5.1E-1 


2.0e- 


3 


9.3e- 


16 


8.4E+0 


2048 




1969 


1969 


3.0E+0 


6.0e- 


3 


5.6e— 


12 


3.2E+1 


4096 




3285 


3287 


9.7E+0 


1.6e- 


2 


6.8e- 


11 


9.8E+1 


8192 




5360 


5362 


4.4e+1 


4.8e- 


2 


6.3e— 


11 


3.0E+2 


16384 




8703 


8707 


2.9E+2 


1.5e- 


1 


5.7e- 


11 


9.3E+2 


32768 




14015 


14013 


1.9E+3 


5.5e- 


1 


7.5e- 


11 


2.9E+3 



only 100 MB and 1.2 GB in the 2D surface and volume cases, respectively; at a lower 
precision of e = 10^'^, these become just 40 and 180 MB. Finally, to provide some 
intuition about the behavior of the algorithm as a function of precision, we report the 
following timings for the 2D volume case with N — 131072: for e = 10^"^, T^n-i = 41 
s and Tmv = 0.09 s; for e = 10-^ Tc^ = 161 s and T^v = 0.18 s; and for e = 10"^ 
Tcm = 339 s and T^v = 0.27 s. 

6.1.2. The Helmholtz equation. We next considered the 2D and 3D Helmholtz 

kernels 



(6.1) G{x,y)^'-H^,'Hk\x-y\) 
and 



ik\x-y\ 

(6.2) Gix,y)^— 

47r \x — y\ 

respectively, where i?Q^^ is the zeroth order Hankel function of the first kind and k 
is the wavenumber. We used the same representative geometries as for the Laplace 
equation. The size of each domain 51 in wavelengths was given by 

k 

u) — — diam(r2). 
27r 

Timing results against LAPACK/ATLAS and the FMM at low frequency (w = 10 
in 2D and a; = 5 in 3D) with e — 10^^ are shown in Fig. 6.2 In this regime, the 



performance is very similar to that for the Laplace equation, as both kernels are 
essentially non-oscillatory; detailed data are therefore omitted. However, as discussed 
in |35| . the compression efficiency deteriorates as w increases, due to the growing 
ranks of the matrix blocks. In the high-frequency regime, there is no asymptotic 
gain in efficiency. Still, numerical results suggest that the algorithm remains viable 
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2D surface 



2D volume 



10^ 

102 


|IIM| 1 1 1 


.o 'i 


lOi 
10° 






10-1 






10-2 






10-3 




1 1 





o— o LP pc □— □ LP mv 



FMM 



RS pc 



RS mv 



Fig. 6.2. CPU times for applying the Helmholtz kernel in various cases at low frequency 
(ui = 10 in 2D and ui = 5 in 3D) using LAPACK/ ATLAS, the FMM, and recursive skeletonization 
at precision e = IQ- ^/ notation as in Fig. \6.l\ 



up to '-^ 200 in 2D and w ~ 10 in 3D. In all cases, the CPU times and storage 
requirements are larger than those for the Laplace equation by a factor of about two 
since all computations are performed over C instead of M; in 2D, there is also the 
additional expense of computing H^p. 

6.2. Recursive skeletonization as a direct solver. In this section, we study 
the behavior of our algorithm as a fast direct solver. More specifically, we considered 
the interior Dirichlct problem for the Laplace and Helmholtz equations in 2D and 
3D, recast as a second-kind boundary integral equation using the double-layer repre- 
sentation (2.8 1. Contour integrals in 2D were discretized using the trapezoidal rule. 



while surface integrals in 3D were discretized using Gaussian quadrature on flat trian- 
gles. In each case, we took as boundary data the field generated by an exterior point 
source; the error was assessed by comparing the field evaluated using the numerical 
solution via (2.8) against the exact field due to that source at an interior location. As 



a benchmark, we also solved each system directly using LAPACK/ATLAS, as well as 
iteratively using GMRES with matrix-vector products accelerated by the FMM. 



6.2.1. The Laplace equation. For the Laplace equation (2.7), the Green's 



function G in (2.10) is given by (2.5) in 2D and (2.9 1 in 3D. As a model geometry. 



we considered an ellipse with aspect ratio a = 2 (semi-major and -minor axes a — 2 
and & = 1, respectively) in 2D and the unit sphere in 3D; these boundaries have 
dimensions d = 1 and d = 2, respectively. Timing results are shown in Fig. |6.3[ with 
detailed data given in Tables 6.5 and 6.6 the precision was set to e = 10^^ in 2D and 
e = 10-6 in 3D. 
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Fig. 6.3. CPU times for solving Laplace's equation in various cases using LAPACK/ATLAS 
(LP), FMM/GMRES (FMM), and recursive skeletonization (RS) as a function of the system size 
N . For LP and RS, the computation is split into two parts: precomputation (pc), for LP consisting 
of matrix formation and factorization, and for RS of matrix compression and factorization; and 
system solution (sv), consisting of matrix inverse application. The precision of the FMM and RS 
was set at e = 10~^ in 2D and e = 10~® in 3D. Dotted lines indicate extrapolated data. 

Table 6.5 

Numerical results for solving Laplace's equation in 2D at precision t = 10~^; A*", uncompressed 
matrix dimension; K^, row skeleton dimension; Kc, column skeleton dimension; Tcm, matrix com- 
pression time (s); Ti^, sparse matrix factorization time (s); Tgv, inverse application time (s); E, 
relative error; M , required storage for compressed matrix inverse (MB). 



N 


K, 


Kc 


T 

^ cm 


Tiu 




T 

^ sv 




E 




M 


1024 


30 


30 


3.4E-2 


2.5e- 


2 


I.Oe- 


3 


9.0e- 


11 


1.6E+0 


2048 


29 


30 


7.0E-2 


5.1e- 


2 


2.0e- 


-3 


9.0e- 


12 


3.3e+0 


4096 


30 


30 


1.4E-1 


9.8e- 


2 


2.0e- 


3 


8.3e- 


11 


6.8E+0 


8192 


30 


31 


3.0E-1 


2.1e- 


1 


4.0e- 


3 


1.6e- 


10 


1.4E+1 


16384 


31 


31 


5.5E-1 


4.5e- 


1 


9.0e- 


-3 


5.5e- 


10 


2.8E+1 


32768 


30 


30 


l.lE+0 


8.5e- 


1 


1.9e- 


-2 


4.9e- 


12 


5.6E+1 


65536 


30 


30 


2.3E+0 


1.8e4 


-0 


3.8e- 


2 


I.Ie- 


11 


l.lE+2 


131072 


29 


29 


4.6E+0 


3.7e4 


-0 


7.5e- 


2 


8.5e- 


11 


2.2E+2 



In 2D, the solver has linear complexity and is exceptionally fast, handily beat- 
ing the 0{N^) uncompressed direct solver, but also coming very close to the 0{N) 
FMM/GMRES iterative solver. At = 131072, for example, the total solution 
time for the recursive skeletonization algorithm was Tpg = 8.5 s, while that for 
FMM/GMRES was Tfmm = 6.9 s using npMM = 7 iterations. It is worth em- 
phasizing, however, that our solver is direct and possesses obvious advantages over 
FMM/GMRES, as described in fjl] in particular, the algorithm is relatively insensi- 
tive to geometric ill-conditioning. Indeed, the direct solver edged out FMM/GMRES 
even at modest aspect ratios (for N = 8192 at e = 10^^^ with a = 8: Trs = 0.76 
s, TpMM — 0-98 s, ripMM — 15); for larger a, the effect was even more pronounced 
(a = 512: Trs — 2.5 s, Tfmm = 3.9 s, tifmm = 44). Furthermore, the compressed 
inverse representation allows subsequent solves to be performed extremely rapidly; for 
instance, at = 131072, the solve time was just Tgv — 0.07 s, i.e., Tfmm/Tsv ^ 100. 
Thus, our algorithm is especially efficient in regimes where Tgv dominates (see, e.g., 
[34j ) . Finally, we remark that although direct methods are traditionally very memory- 
intensive, our algorithm appears quite manageable in this regard: at A^ = 131072, the 
storage required for the compressed inverse was only 106 MB for e = lO^'^, 172 MB 
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Table 6.6 

Numerical results for solving Laplace's equation in 3D at precision e = 10~^; notation as in 
Table\KM 



N 














E 




M 


720 


628 


669 


1.3E+0 


l.lE-1 


I.Oe- 


3 


9.8e- 


5 


4.6E+0 


1280 


890 


913 


4.5E+0 


4.0E-1 


3.0e- 


3 


5.5e- 


-5 


l.lE + 1 


2880 


1393 


1400 


2.1E+1 


2.0E+0 


1.2e- 


2 


2.4e- 


5 


5.5E+1 


5120 


1886 


1850 


5.5E+1 


5.4E+0 


2.7e- 


2 


1.3e- 


5 


1.3e+2 


11520 


2750 


2754 


1.6E+2 


1.7E + 1 


7.2e- 


2 


6.2e- 


6 


3.5e+2 


20480 


3592 


3551 


3.7E+2 


4.1E + 1 


1.5e- 


1 


3.3e- 


6 


6.9E+2 



for e ^ 10-^ and 222 MB for e = 10"^ 

In 3D, our solver has complexity 0(iV^/^). Hence, asymptotics dictate that it 
must eventually lose. However, our results demonstrate that even up to = 20480, 
the solver remains surprisingly competitive. For example, at A'' = 20480, Trs — 409 s, 
while TpMM — 131 s with tifmm = 3; at e = 10~^, the difference is almost negligible: 
Trs = 850 s, TpMM = 839 s, npuM = 5. Thus, our algorithm remains a viable 
alternative for medium-scale problems. It is important to note that the solve time 
advantage is not lost even for large N, since the cost of each solve is only 0{N\ogN). 
In fact, the advantage is, remarkably, even more striking than in 2D: at iV = 20480, 
Tfmm/T,v - 1000; for e = lO"", Tfmm/TIv ~ 2500. 

6.2.2. The Helmholtz equation. We then considered the Helmholtz equation 

(A + A:^) u = inn, u = f on dn, 



recast as a boundary integral equation (2.10), with Green's function (6.1) in 2D 



and (6.2 1 in 3D. This representation does not work for all frequencies, encountering 
spurious discrete resonances for k beyond a critical value. We ignore that (well- 
understood) issue here and assume that the integral equation we obtain is invertible. 
The method itself does not falter in such cases, as discussed in |35j . 

We used the same geometries and precisions as for the Laplace equation. In 2D, 
the double-layer kernel is weakly singular, so we modified the trapezoidal rule with 
tenth-order endpoint corrections [3^. The frequency was set to w = 10 in 2D and 
w = 3.18 in 3D. 



Timing results are shown in Fig. 6.4 The data are very similar to that for 



the Laplace equation, but with the direct solver actually beating FMM/GMRES in 
2D. This is because the number of iterations required for FMM/GMRES scales as 
"-FMM = ^{^)- Interestingly, even at moderately high frequencies, where we would 
expect the direct solver to break down as discussed in §6.1[ the performance drop is 
more than compensated for by the increase in the number riFMM of iterations. In 
short, we find that recursive skeletonization is faster than FMM/GMRES at low to 
moderate frequencies, provided that the memory requirement is not excessive. 

The story is much the same in 3D and the compressed solve time is again very 
fast: at iV = 20480, Tfmm/Tsv ^ 2000. 

6.3. Molecular electrostatics. An important application area for our solver is 
molecular electrostatics. A simplified model for this involves consideration of a molec- 
ular surface E, dividing M.^ into fli and 0,2, denoting the molecule and the solvent, 
respectively. We also suppose that the molecule has interior charges of strengths qi at 
locations e ili for i = 1, . . . , n. The electrostatic potential ip (ignoring salt effects 
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10^ 10* lO'^ 10^ 

N 



o— o LP pc □— □ LP sv «— ♦ FMM •— • RS pc ■— ■ RS sv 



Fig. 6.4. CPU times for solving the Helmholtz equation in various cases at low frequency 
(lj = 10 in 2D and u) = 3.18 in 3D) using LAPACK/ATLAS, FMM/GMRES, and recursive skele- 
tonization; notation as in Fig. |6. 3[ The precision was set to e = 10~^ in 2D and e = 10~^ in 



in the solvent) then satisfies the Poisson equation 

n 

-V • [e (x) V(p (x)] = ^ QiS {x - Xi) , 

i=l 

where 

, , _ J £i if x e ill, 
^ " \ £2 if a; e ^2 

is a piecewise constant dielectric. We decompose the solution as f — + tpp, where 
ifs is the potential due to the sources: 



(6.3) 



1 " 

(x) = — y^qiG{x,Xi) 



with G given by (2.9), and (pp is a piecewise harmonic potential, satisfying the jump 
conditions 



[fp] = 0, 



dv 



dips_ 
dv 



on S, where u is the unit outer normal. We can write (fp, called the polarization 
response., as a single-layer potential |24| 



ipp{x) = / G{x,y)a{y)dy, 



(6.4) 

which yields the boundary integral equation 
1 



-a (x) + X 



^ix,y).iy)dy^-x'^ix), 

OVr OV 



where A = (ei — e2)/(ei + £2), in terms of the polarization charge a. Once a has been 
computed, the potential at any point can be evaluated using (6.31 and (6.4). 
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We generated a molecular surface for a short segment of DNA [T51 PDB ID: 
IBNA] consisting of iV = 19752 triangles using MSMS [12] ■ Strengths were assigned 
to each of n = 486 heavy atoms using Amber partial charges [5J through PDB2PQR 
[14] . The system was solved with ei = 20 and £2 = 80 at precision e = lO^''; the 
resulting potential ip on Tt is shown in Fig. |6.5| The net solution time was T^g = 592 
s, with an inverse application time of Tgv = 0.08 s, to be compared with Tfmm = 27 
s using FMM/GMRES. Thus, when sampling the electrostatic field for many charge 
configurations {qi}, as is common in computational chemistry (e.g., [3]), our solver 
can provide a speedup provided that the number of such configurations is greater 



than ^ 25. We remark that the evaluation of tp at fixed points, e.g., on E, via (6.31 



and (6.4) can also be accelerated using our algorithm in its capacity as a generalized 



FMM; the computation time for this would be similar to Tsv 

6.4. Multiple scattering. As a final example, we show how direct solvers can 
be combined with FMM-based iterative methods to great effect in the context of 
a multiple scattering problem. For this, let il^, for i — 1, . . . ,p, be a collection of 
acoustic scatterers in 2D with boundaries Ej . Then, using the language of acoustics, 
the pressure field satisfies 



(6.5) 



p 

{A + k^)u = in \ y Qi. 
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Assuming that the obstacles are sound-hard, we must compute the exterior solution 
that satisfies the Neumann boundary condition 



du 
dv 



on y 



If w = Ui + Us, where Ui is an incoming field satisfying (6.5), then the scattered field 
Ug also satisfies (6.5) with boundary condition 



dv 



dv 



on 



We write the scattered field as Us = X^iLi where 



Us,t {x) = I G{x,y)a^ iy)dy, 



where G is the single-layer kernel (6.1). Imposing the boundary condition yields the 
second-kind integral equation 



1 ^ 



duj 
dv 



on Ei, i = l,...,p, 



Si 



where 



Kij(Jj (x)^ [ ^ {x, y) aj (y) 



dy for X Tii 



In operator notation, the linear system therefore has the form 



duj 
dv 



A, 



s. 



Kii if i = j, 
if i ^ j. 



We solve this system using FMM/GMRES with the block diagonal preconditioner 



^11 



A-^ 
^pp 



where each is computed using recursive skeletonization; observe that A~j^ is 
precisely the solution operator for scatterer Hi in isolation. The question is whether 
this preconditioner will significantly reduce the iteration count required, which is 
typically quite high for problems of appreciable size. 

As a test, we embedded two identical scatterers, each described in polar coor- 
dinates by the radial function r = [2 -|- cos(36')]/6, where 6 is the polar angle; each 
scatterer is smooth, though somewhat complicated, and was taken to be ten wave- 
lengths in size. We assumed an incoming field given by the plane wave Ui = e^'^^^, 
where x — {xi,X2), and considered the scattering problem at various horizontal sepa- 
ration distances S between the centers of the scatterers. Each configuration was solved 
both with and without the preconditioner to precision e = 10~^; each scatterer 
was discretized using a corrected trapezoidal rule [30 with N = 1024 points. 
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5/A = 30 5/ A = 20 




Fig. 6.6. Instantaneous intensity [5R(u)]^ of the pressure field in response to an incoming 
vertical plane wave for various scattering geometries characterized by the separation distance 5/\ 
in wavelengths between the centers of two identical scatterers. 



Table 6.7 

Numerical results for the multiple scattering example, consisting of six configurations with var- 
ious separation distances 5/\, relative to the wavelength, between the centers of two identical scat- 
terers, solved to precision e = 10~^; TpMM> time for FMM/GMRES solve (s); T^is, time for 
preconditioned FMM/GMRES solve (s); npMM, number of iterations required for FMM/GMRES; 
nj^g , number of iterations required for preconditioned EMM/ GMRES; E, relative error; Ton , matrix 
compression time for scatterer (s); Tiu, sparse matrix factorization time for scatterer (s). 



(5/A 


TpMM 




n-PMM 




E 




30.0 


7.9E+1 


8.9E-1 


697 


8 


1.3e- 


^8 


20.0 


7.7E+1 


l.lE+0 


694 


10 


5.8e- 


9 


15.0 


8.0B+1 


1.2E+0 


695 


11 


6.9e- 


9 


12.5 


7.9E+1 


1.3E+0 


695 


12 


7.8e- 


-9 


11.0 


7.9E+1 


1.4E+0 


704 


14 


8.7e- 


-9 


10.5 


8.0E+1 


1.5E+0 


706 


14 


1.3e- 


8 






6.6E-1 














9.3E-2 










total 


4.7e+2 


S.Ie+O 











The intensities of the resuhing pressure fields are shown in Fig. 6.6 with numerical 
data given in Table [6771 It is clear that the preconditioner is highly effective: following 
a precomputation time of 0.76 s to construct , which is amortized over all solves, 
the number of iterations required was decreased from n-puM ^ 700 to just n^g ~ 10 
for each case. As expected, more iterations were necessary for smaller 6, though the 
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difference was not too dramatic. The ratio of the total solution time required for all 
solves was ^ 60 for the unpreconditioned versus the preconditioned method. 

7. Generalizations and conclusions. We have presented a multilevel matrix 
compression algorithm and demonstrated its efficiency at accelerating matrix-vector 
multiplication and matrix inversion in a variety of contexts. The matrix structure 
required is fairly general and relies only on the assumption that the matrix have low- 
rank off-diagonal blocks. As a fast direct solver for the boundary integral equations of 
potential theory, we found our algorithm to be competitive with fast iterative methods 
based on FMM/GMRES in both 2D and 3D, provided that the integral equation kernel 
is not too oscillatory, and that the system size is not too large in 3D. In such cases, 
the total solution times for both methods were very comparable. Our solver has clear 
advantages, however, for problems with ill-conditioned matrices (in which case the 
number of iterations required by FMM/GMRES can increase dramatically), or those 
involving multiple right-hand sides (in which case the cost of matrix compression and 
factorization can be amortized). The latter category includes the use of our solver 
as a preconditioner for iterative methods, which we expect to be quite promising, 
particularly for large-scale 3D problems with complex geometries. 

A principal limitation of the approach described here is the growth in the cost 
of factorization in 3D or higher, which prohibits the scheme from achieving optimal 
0{N) or nearly optimal 0{N log N) complexity. It is, however, straightforward to 
implement and quite effective. All of the hierarchical compression-based approaches 
(HSS matrices [lllTllin], 'H-matrices [351 HZl [2H] and skeletonization [T7 1 [2T | [55 ] ) are 
capable of overcoming this obstacle. The development of simple and effective schemes 
that curtail this growth is an active area of research, and we expect that 0{N log N) 
direct solvers with small pre-factors in higher dimensions will be constructed shortly, 
at least for non-oscillatory problems. It is clear that all of these techniques pro- 
vide improved solution times for high-frequency volume integral equations, due to 
the compression afforded by Green's theorem in moving data from the volume to the 
boundary. More precisely, the cost of solving high-frequency volume wave scatter- 
ing problems in 2D are 0{N^^^) and 0{N log N) for precomputation and solution, 
respectively. For related work, see [H ST] . 

Finally, although all numerical results have presently been reported for a single 
processor, the algorithm is naturally parallelizable: many computations are organized 
in a block sweep structure, where each block can be processed independently. This is 
clearly true of the recursive skeletonization phase using proxy surfaces (with a possible 
loss of 0{logN) in performance since there are 0{logN) levels in the hierarchy). 
As for the solver phase, arguments can be made in support of both the original 
"hand-rolled" Gaussian elimination approach and our framework that relies on sparse 
embedding. We expect that, by making use of UMFPACK and other state-of-the-art 
parallel sparse solvers (e.g., SuperLU gl], MUMPS [J, Pardiso 03], WSMP [25]), 
our overall strategy will help simplify the implementation of skeletonization-based 
schemes on high-performance computing platforms as well. 

Acknowledgements. We would like to thank Zydrunas Gimbutas and Mark 
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